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Abstract- For any deep computational processing of 
language we need evidences, and one such set of evidences 
is corpus. This paper describes the development of a text- 
based corpus for the Bishnupriya Manipuri language. A 
Corpus is considered as a building block for any language 
processing tasks. Due to the lack of awareness like other 
Indian languages, it is also studied less frequently. As a 
result the language still lacks a good corpus and basic 
language processing tools. As per our knowledge this is the 
first effort to develop a corpus for Bishnupriya Manipuri 
language. 
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I. INTRODUCTION 


A Corpus is the source of data for the research of 
natural language processing applications. A corpus data 
may contain text in a single language (monolingual 
corpus) or in multiple languages (multilingual corpus). 
Newspapers and books are the most used resources for 
the development of a corpus. Now-a-days, the free 
resources on Internet are widely used as a source. A 
corpus should be a balanced one to cover all the 
properties of the different kinds of text it contains and 
should be a good representative for the language. 


This paper reports the development of a corpus on 
the Bishnupriya Manipuri language and the result of the 
analysis of the data from the corpus. Section II gives an 
outline on the Bishnupriya Manipuri language. Section 
III describes the corpus development. Then the corpus is 
analysed in Section IV. The word analysis of 
Bishnupriya Manipuri language that we found from our 
corpus is also given in this section. Section V gives the 
analysis on word frequency. Section VI concludes the 
study followed by the references. 


II. THE BISHNUPRIYA MANIPURI 
LANGUAGE 


The Manipuris are divided into two groups, namely, 
the Meiteis and the Bishnupriyas. The Bishnupriyas are 
of Aryo-Mongolo-Dravidian origin and the Meiteis are 
of Mongolian origin. The Bishnupriyas are generally 


dark in complexion. On the other hand, the Meiteis are 
generally yellowish in complexion. The Meiteis call 
their language ‘Meitei’ or ‘Manipuri’ which is the state- 
language of Manipur. Formerly, the Bishnupriyas used 
to call their language simply ‘Manipuri’ but, now, they 
call it ‘Bishnupriya Manipuri’ to distinguish it from 
Meitei. The Bishnupriya Manipuri language is of Indo- 
Aryan origin and is a kin to Oriya, Assamese and 
Bengali. On the other hand, the Meitei language is of the 
Kuki-Chin branch of the Tibeto-Burman group of 
languages. Though both the languages have differences 
in many factors, these two sections of people have a 
common stock of culture; their kirtana, dances, music, 
dress etc - all are of the same type. There is no bar to 
matrimonial relations between these sections of people, 
but in practice, it is of rare occurrence. 


The Meitei language, though grouped under the 
Tibeto-Burman group of languages, has absorbed a 
number of words from the neighbouring group of Indo- 
Aryan languages. The Bishnupriya Manipuri language, 
on the other hand, though of Indo-Aryan origin, has 
incorporated numerous words (about 4,000) from 
Meitei. [1] 


The Bishnupriya Manipuri language was originally 
confined only to the surroundings of the lake Loktak in 
Manipur. The principal localities where this language 
was spoken are now known as Khangabok, Heirok, 
Mayang-Yumphan, Bishnupur, Khunau, 
Ningthaukhong, Thamnapokpi and other places. The 
people of these places are known as Bishnupriyas even 
now, and are similar to the Bishnupriyas living outside 
Manipur in respect of their appearance and complexion. 
They, of course, neither speak nor understand 
Bishnupriya; they all speak Méeitei. Formerly, 
Bishnupriya Manipuri speakers were very numerous in 
the localities mentioned, a major portion of which was 
included in the Khumal kingdom. But, when a great 
majority of these people fled from Manipur, during 
different reasons, it was very difficult for the few that 
remained there to retain their language in the face of the 
impact of Meitei spoken by the majority. They gradually 
began to forget their languages and assimilated with the 
speakers of the dominant language- The Meitei. There 


are two dialects in the Bishnupriya Manipuri language, 
namely, the Madai Gang dialect or the dialect of the 
village of the queen and the Rajar Gang dialect or the 
dialect of the village of the king. The Madai Gang 
dialect is also known as Leimanai and the Rajar Gang 
dialect as Ningthaunai. The term Leimanai is derived 
from Leima (queen)+(ma) nai (attendant), meaning the 
attendants of the queen, and the word Ningthaunai, from 
ningthau (king)+(ma) nai (attendant) meaning the 
attendants of the king. Unlike the dialects of other tribes, 
these dialects of Bishnupriya are not confined to distinct 
geographical areas; they rather exist side by side in the 
same localities [2]. 


The Bishnupriya Manipuri language is practically 
dead in its place of origin. However, the language is 
retained by its speakers in diaspora mostly in Assam, 
Tripura and Bangladesh. The language is still enlisted as 
an endangered language by UNESCO [3]. Interestingly, 
a thorough linguistic study of this little known language 
was done by Dr. K. P. Sinha. The study done was 
mainly on morphological analysis. An etymological 
dictionary is also published [4]. But no study on the 
computational linguistic of this language is done. To the 
best of our knowledge, the present work based on the 
available publications, is the first of its kind in this 
direction. 


Til. CORPUS DEVELOPMENT 


A corpus (plural corpora) or text corpus is a large 
collection of texts. They are used to do statistical 
analysis on linguistic rules on a specific universe. 


A corpus may contain texts in a single language 
(monolingual corpus) or text data in multiple languages 
(multilingual corpus). Multilingual corpora that have 
been specially formatted for side-by-side comparison 
are called aligned parallel corpora. 


In order to make the corpora more useful for doing 
linguistic research, they are often subjected to a process 
known as annotation. An example of annotating a 
corpus 1s POS-tagging, in which information about each 
word as a part of speech (verb, noun, adjective, etc.) is 
added to the corpus in the form of tags. When the 
language of the corpus is not a very known or popular 
language, the corpus is annotated bilingual. 


Some corpora are further analysed in structural 
levels. In particular, a number of smaller corpora may be 
fully parsed. Such corpora are called Parsed Corpora or 
Tree banks. These corpora are completely and 
consistently annotated, and usually smaller, containing 1 
to 3 million words. Corpora can be further analysed, 
including annotations for morphology, pragmatics etc. 


Corpora are considered as the main knowledge base 
for any language processing task. Frequency list of 
words from the corpora are useful in linguistic works. 
They are also useful in language teaching. The analysis 
and processing of various types of corpora are widely 


done in speech recognition and machine translation. 
Corpora are very much helpful in learning and writing a 
foreign language because corpora allow the non-native 
language users to grasp the manner of sentence 
formation in the foreign language. 


Feature Analysis: 


While developing a corpus, we should consider some 
features. First, we should consider mood of the text data, 
whether the data collected originates in speech or 
writing. Second is the data samples collected for the 
corpus. Entire document or transcriptions of speech 
events might be considered for the corpus, but the 
samples should be in proper size so that the system can 
process the data. The third feature is that data samples 
collected for the corpus should be good enough for the 
representation of the language. The last one is that the 
corpus should be balanced. It should cover different 
kinds of texts with all properties. 


It has already been realized that free resources on 
Internet such as blogs, Wikipedia, web pages etc. could 
be used for developing a large corpus. Such materials 
could create a problem when we collect them from 
different sources. 


Free resources on Internet such as blogs, webpages, 
Wikipedia etc. become the source for the development 
of a corpus. But this method cannot help much for the 
Bishnupriya Manipuri language as very small number of 
people uses this language and very small numbers of 
resources are available on the Internet. Moreover, most 
of the resources present in the Internet are in graphic 
format, which cannot be used in this language 
processing work. 


We have collected many texts from the Bishnupriya 
Manipuri version of Wikipedia [5]. Though it is a very 
small one, but still it could be used as a corpus. Further 
we have collected large number of text written in Smriti 
Font, which is an ASCII Font. For this reason, we have 
built an ASCTI-to-UNICODE Converter, which converts 
ASCU-encoded texts to UTF-8 texts. With the help of 
this converter, we have converted approx. 45,000 words 
from different texts. Presently this corpus contains 
approx. one lakh words, with 10,196 sentences. 


IV. WORD ANALYSIS 


We have analysed the word structure of the 
Bishnupriya Manipuri language from the data of the 
corpus. Some portions of our result are shown below: 


1) NOUN: 
Gender Suffixes: 


A. In the case of words indicating human beings, the 


word 31 (muni) and caret (dela) are used before the 
word to indicate masculine and feminine genders 
respectively. 


E.g. —aferry (munimanu : man), CB (elamanu : 


woman) 
B. The feminine gender is generally indicated by the 


use of the word Cail (dela) after the words indicating 
common gender. 

C. Feminine gender is formed by adding the 
following suffixes to the masculine forms of words: 


i) at (i): {Wt (k"usa : father’s younger brother) -> act 
(k"usi : the wife of father’s younger brother), Ceara 
(&et"aba : father’s elder brother) -> CRA (d&et*ima : 
the wife of father’s elder brother) 

ii) <lelt (ani): DIA (sakor : servant) -> DIPS (sakoszani 
: maid servant) 

iii) Vt (ni): DITA (samaz : cobbler) -> DITA (samasani : 


female cobbler) 


Number suffixes: 


A. The word “if (gafi) compounded with the stem 
preceding, which is invariably a noun of relationship, 
bears a plural sense. 


E.g. — waite (dadagafi : 


(mamagafi : maternal uncles) 


elder brothers), Watate 


B. fo (i) : This suffix is added to the singular forms 
in —7T (g) and 2I (han) resulting to fr (gi) and aif 
(hani) respectively, in the sense of ‘a small number of. 
E.g.- Wal" (mafogi : the few fishes) 

C. WBZ (mahei) : This suffix is added in the sense 
of ‘a large number of’. 


E.g.- SPAS (guiumahei : many cows) 
D. In some cases, the plural sense is carried by the 
addition of adjectives bearing plural sense: 


i) eft (guli): It is used as an adjective in the sense of ‘a 
small number of’. 

E.g. — Weel (mafoguli : a few fishes) 

ii) BI& (habi): It is used before or after the stem as an 
adjective in the sense of ‘all’. 

E.g. — *1atfe (manuhabi : a// men), fF (habimanu : 
all men) 


iii) MST (eta), SS! (outa): These are used after the stem 


as adjectives in the sense of ‘these’ and ‘those’ 
respectively. 


E.g. — WJ Mol (manueta : these people) 


Case suffixes: 


A. Nominative: The nominative related to an 
intransitive verb generally does not take any ending. The 


terminations —4| (e), -¥ (j) and —@ (ue), -& (je) are 
generally added to the nominatives. 

E.g. - 4X4 OS cA (puoi b'at k"eil- : Purna took 
his meals) 


B. Accusative: Objects indicating inanimate things 
generally do not take any termination. Objects 


indicating animals take the terminations —C4 (te) and — 
ACA (ore) 
E.g. — ft ae DISA (mi mantuse sausi : J am looking 
at Mantu) 


C, Instrumental: The instrumental termination —7 (1) 


is added to the accusative form of the noun. As in the 
case of words indicating non-living things and lower 
animals, no termination is added in the accusative, the 


affix —*T (1) is directly added to the stem. In the case of 
words indicating higher animals, -*1 (1) is added to the 
accusative form in —C4 (1e) or AC (ore). 

E.g. — CGA (muselo : with me) 

D. Dative: -C4 (1e), ACs (ore), -<8 (tan) 

E.g. — lc & (muse de : give me) 

E.  Ablative: 8 (to), a8 Fkf-se (tts), -awe 
(etto), -XS (santo) 


E.g. — ates (musanto : from me) 
F. Genitive: The terminations are — (1) and 44 


(01). The affix — 4 (1) becomes — 4 (01) when the stem 
ends in a consonant or has its final vowel unpronounced. 


E.g. — lt (mux : my) 

G. Locative: -S (t), -4 (e), -T% (san) 
E.g. - Wl (g"oue : in the house) 

H. Vocative: -'$ (u), -GAl (ru), -G (te) 
E.g. — “mica (dadazu : O elder brother) 


2) PRONOUN: 
The Pronoun for the First Person: 


A. Singular Form: a (mi) 

B. Plural Form: Sf (ami : we) 

C. There are two oblique forms, Cl (mu) and art 
(ama) to which the inflexions and post-positions are 
added to form various cases. CT (mu) is for the singular 
forms and Sl (ama) is for the plural forms. 

E.g. — Gl (mu) -> Gita (mure), CCHT (musel), carats 


(muaan) 
The Pronoun for the Second Person: 
A. Singular Form: FS (ti) 


B. Plural Form: of (tumi) 


C. There are oblique forms O9T (tu) and -@4l (tuma) 


to which case inflexions and post-positions are added to 
form various cases. 


E.g. — OST (tu)-> COIta (ture), COTA (tusan) 

The Pronoun for the Third Person: 

A. Singular Form: ‘8! (ta) in masculine gender and 
C88 (tei) in feminine gender. 

B. Plural Form: Ital (tanu) or lJ (tanu) which is 
common to both masculine and feminine genders. 

C. The Proximate Demonstrative Pronoun: -4 (€) 


D. The Remote Demonstrative Pronoun: @ (ou) 


E. The Relative Pronoun: ¢&t (ge) 


3) VERB: 
There are three moods of verb in  Bishnupriya: 
Indicative, Imperative and Subjunctive. 


A. The Indicative Mood: 


Indicative Mood may be divided into Simple and 
Compound. 
The Simple Tense Suffixes: 


1. Simple Present: -4 (1), Off (uri), -@f% (oui), - 
Gal (ousi), - Bl (ouzi) 

E.g. — #4 (kor: to do) -> 24 (kom), PF (Kosuri) 

2 The Precative Present: -2¥ (itu), -208 (ite), 38 
(it), -ZSI (itan), -Z61 (itai), -FSt (ita) 


The initial -1 of the endings is dropped after roots 
ending in consonant. Initial -i- of the endings plus the 
final -1- of the root becomes —1-. 


E.g. — 4 (kor : to do) -> P8¢ (kostu), FACS (kote), 
PAS (kot) 

3. Simple Past: -23 (ilu), -2061 (ile), -24T (ilo), -S-7 
(ilan), -271& (ilaj), -2e1 (ila) 

E.g. — 1 (pa : to get) -> CA8% (peilu), C7180 (peile), 
CARH (peilo) 

4, Simple Future: -305t (itou), -3068 (itei), 301 
(itau), -SSIwIES (itanai), -LoIae (itarai), -TSlS (itai) 

E.g. — & (k’a : to eal) -> CABOST (kraitou), CABTSS 
(kraitei), CDOS (k*aitau) 

5. Probable Future: -24 (im), -204 (ibe), -34 (ibo), - 
Bale (iban), -2212 (ibaj), S21 (iba) 

E.g. — 2 (t's : to keep) -> AST (t*oim), AS (t*oibe), 


Bq (t*oibo) 


The Compound Tense Suffixes: 

1. Present Progressive: There is no special form for 
the present progressive tense. Its meaning is carried by 
the combination of the non-finite form of the principal 


verb in —2&1 (ija) or AS (at) / -4NS (nat) and the simple 
present tense of the root —S1R (af). 

E.g. - 4 (kor: fo do) -> Saat AY (kosija afu) 

2: Present Perfect: -48 (efu), -4@ (efot), -C& 
(efe), -f (efi), -1B (ef) 

The initial -4 (e) of the endings plus the final (fll 


and -St (a) of the root becomes -3(9) and -at (a) 
respectively. 

E.g. - & (t's : to keep) -> XB (tofu) 

3. Past Progressive: Its meaning is carried by the 
combination of the non-finite form of the principal verb 


in —221 (ija) or SNS (at) / -4NS (nat) and the simple past 
tense form of the root —S1& (af). 

E.g. - A (pi : to drink) -> fat atfey (piya afilu) 

4, Past Perfect: FREI (efilu), FREI (efile), -2 faq 
(ifil), -Sfete (ifilan) 

E.g. - "ll (pa : to get) -> "2 (pafile) 

5. Probable Past: It is formed by the combination of 


the non-finite form of the principal verb in —28 (ija) or 
its contracted form —e and the probable future form the 


root —2t (tha : fo remain). 
E.g. - 4 (kor: to do) -> PA 2B (kore t*aim) 


6. Future Progressive: Its meaning is carried by the 
combination of the non-finite form of the principal verb 


in —33t (ija) or AS (ot) / -ATS (nat) and the simple 
future form of the root —2tt (t*a). 


E.g. - 84 (kor : to do) -> Sfeat WBCSt (kosija t'aitou ) 
B. The Imperative Mood: 


The present imperative mood has the following endings: 
-38 (in), SI (0), AF (ok), -3% (ik), AF (oka) 
E.g. - ®4 (kor : to do) -> @f88 (kosin) 


C. The Subjunctive Mood: 


There is no special form for the subjunctive mood. Its 
meaning is carried by the help of words, such as, aiste 
and jadi, the first for the past subjunctive and the second 
for the future subjunctive. 


V. FREQUENCY ANALYSIS 


From the corpus, we have collected approx. twenty 
five thousands inflected words. 


20 highest frequency words of the corpus are shown 
below: 


TABLEL First twenty frequent words from 


the corpus 
IPA : Meaning Frequency 
Aca baso : and, again 652 
Aral] ahan : one 644 
aia bai : things 641 
arts amal : our 528 
as nei : absent 405 
wold tai: his 400 
at na : negative prefix 392 
arc loge : with 369 
oat oja : becoming 366 
gferat bulija : Aaving said 333 
eae kosija : having done| 303 
Pcs kore : having done 302 
fret dija : giving 297 
4atFq ehan : this 290 
ala al: other 266 
coe tei: she 262 
aa habi : a// 256 
APA okiolo : started 254 
wy manu : man 254 
aCe afe : remain 250 


VI. CONCLUSION AND FUTURE WORKS 


In this paper, we have discussed the development of 
a text-based corpus in the Bishnupriya Manipuri 


language. Due to the lack of awareness like other Indian 
languages, this language is studied less frequently. As a 


result, the language still lacks a good corpus and basic 


language processing tools. We are unaware of any 
corpus development work on Bishnupriya Manipuri 
language. Our main motivation is to develop the 
resources of linguistics work on this language. 


In future, we will further increase the size of this 
corpus and will add more sections to the corpus. We are 
also planning to develop language processing tools on 
this language. One interesting morphological feature of 
this language is the word formation with Tibeto-Burman 
Meitei roots. Studies of such special features remain to 
be done in future. 
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